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Abstract 

We propose four estimators of the directed information rate between a pair of jointly stationary 
ergodic finite-alphabet processes based on universal probability assignments. The first one is a Shannon- 
McMillan-Breiman type estimator, similar to those used for estimation of other information theoretic 
quantities such as in Verdu (2005) and Cai, et al (2006). We show its almost sure and L\ convergence for 
any universal probability assignment. The other three estimators plug-in universal probability assignments 
in different functionals to smooth the outputs, and they have different merits such as nonnegativity and 
boundedness. We establish consistency of these estimators in almost sure and L\ senses, and derive 
near-optimal rates of convergence in the minimax sense under mild conditions. These estimators carry 
over directly to estimating other information measures of stationary ergodic finite-alphabet processes, 
such as entropy rate and mutual information rate, and provide alternatives with near-optimal theoretical 
performance to classical approaches in the existing literature. Guided by the theoretical results, we 
use context tree weighting (CTW) as the vehicle for the implementations of the proposed estimators. 
Experiments on synthetic and real data are presented, demonstrating the potential of the proposed schemes 
in practice and the efficacy of directed information estimation as a tool for detecting and measuring 
causality and delay. 
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ability assignment 

I. Introduction 

First introduced by Marko [1] and Massey |2), directed information arises as a natural counterpart of 
mutual information for channel capacity when causal feedback from the receiver to the sender is present. 
In and @J, Kramer extended the use of directed information to discrete memory less networks with 
feedback, including the two-way channel and the multiple access channel. Tatikonda and Mitter [5] used 
directed information spectrum to establish a general feedback channel coding theorem for channels with 
memory. For a class of stationary channels with feedback, where the output is a function of the current 
and past m inputs and channel noise, Kim [6] proved that the feedback capacity is equal to the limit 
of the supremum of the normalized directed information from the input to the output. In [7], Permuter, 
Weissman, and Goldsmith considered the capacity of discrete-time channels with feedback where the 
feedback is a time-invariant deterministic function of the output. Under mild conditions, they showed 
that the capacity is the maximum of the normalized directed information between the input and output 
sequences in the limit. Recently, Permuter, Kim, and Weissman (H showed that directed information 
plays an important role in portfolio theory, data compression, and hypothesis testing, in the presence of 
causality constraints. 

Beyond information theory, directed information is a valuable tool in biology, for it provides an 
alternative to Granger causality (£l, which has been perhaps the most widely-established means of 
identifying causal inference between two processes. In Mathai, Martins, and Shapiro [10], directed 
information was used to identify pairwise influence. Rao, Hero, States, and Engel [11] used directed 
information to test the direction of influence in gene networks. 

Since directed information has significance in various fields, it is of both theoretical and practical 
importance to develop efficient ways for estimating it. The problem of estimating information measures, 
such as entropy, relative entropy and mutual information, has been extensively studied in the literature. 
Verdii lfT2l gave an overview of universal estimation of information measures. Wyner and Ziv |[T3l 
applied the idea of Lempel-Ziv parsing to estimate the entropy rate, which converges in probability for 
all stationary ergodic processes. Ziv and Merhav [14] used Lempel-Ziv parsing to estimate relative entropy 
(Kullback-Leibler divergence) and established consistency under the assumption that the observations are 
generated by independent Markov sources. Cai, Kulkarni, and Verdii [15 ] proposed two universal diver- 
gence estimators for finite-alphabet sources, one based on the Burrows-Wheeler transform (BWT) |[l6l 
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and the other based on the context tree weighting algorithm (CTW) ifTTl . The BWT-based estimator was 
applied in universal entropy estimation in Cai, Kulkarni, and Verdii lfT8ll . while the CTW-based one was 
applied in universal erasure entropy estimation in Yu and Verdii 1191 . 



For the problem of estimating directed information, Quinn, Coleman, Kiyavashi, and Hatspoulous 11201 
developed an estimator to infer causality in ensemble neural spike train recordings. Based on parametric 
generalized linear model (GLM) assumption and stationary ergodic Markov assumption [20], they showed 
strong consistency results. Compared to GUI . Zhao, Kim, Permuter, and Weissman ETI focused on 
universal methods and showed L\ consistency for all jointly stationary ergodic process pairs with finite 
alphabet. 

As an improvement and further development of [21], the main contribution of this paper is a general 
framework for estimating information measures of stationary ergodic finite-alphabet processes, using 
"single-letter" information-theoretic functionals. Although our methods can be applied in estimating a 
number of information measures, we focus — for concreteness and relevance to emerging applications — 
on estimating the directed information rate between a pair of jointly stationary ergodic finite-alphabet 
processes. The first proposed estimator is adapted from the universal divergence estimator in lfl31 using 
the CTW algorithm, and we give a refined analysis yielding strong consistency results. We further propose 
three additional estimators in a unified framework to estimate the directed information rate between a 
pair of jointly stationary ergodic finite-alphabet processes, present both weak and strong consistency 
results, and establish near-optimal rates of convergence under mild conditions. We then employ our 
estimators on both simulated and real data, showing their effectiveness in measuring channel delays and 
causal influences between different processes. In particular, we use these estimators to observe significant 
causal influence from the Dow Jones Industrial Average to the Hang Seng Index, but relatively low causal 
influence in the reverse direction, based on the daily market data in the period from 1990 to 2011. 

The rest of the paper is organized as follows. Section II reviews some preliminaries and Section 
III presents our proposed estimators and some of their basic properties. Section IV is dedicated to 
performance guarantees for the proposed estimators, rates of convergence results under mild conditions, 
and minimax optimality. Section V shows experimental results applying the proposed estimators, both 
on simulated and real data, and demonstrates the effectiveness of these estimators in inferring delay of 
channels and causal influences between processes. Final remarks are made in Section VI and the main 
proofs are given in the Appendices. 
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II. Preliminaries 

We begin with mathematical definitions of directed information, causally conditional entropy, as well as 
universal and pointwise universal probability assignments. We then introduce the context tree weighting 
(CTW) algorithm used in our implementations of the universal estimators that are introduced in the next 
section. 

We use uppercase letters X, Y, . . . to denote random variables, and lowercase letters x,y, . . . to 
denote values they assume. We denote the n-tuple (Xi, X2, ■ ■ ■ , X n ) as X n and {x\, . . . , x n ) as 
x n . Calligraphic letters X,y, ... denote alphabets of X, Y, . . ., and \X\ denotes the cardinality of X. 
Boldface letters X, Y, . . . denote stochastic processes, and throughout this paper, they are finite-alphabet. 
Given a probability law P, P{x l ) = P{X l = x 1 } denotes the probability mass function (pmf) of X 1 
and P{xi\x l ~ l ) denotes the conditional pmf of Xi given {X 1 " 1 = x 1 ^ 1 }, i.e., with slight abuse of 
notation, Xi here is a "dummy variable" and P{xi\x l ~ l ) is the element of M.(X), the simplex in R of 
probabilities on X, representing said conditional pmf. Accordingly, P{xi\X 1 ^ 1 ) denotes the conditional 
pmf P{xi\x l ~ l ) evaluated for the random sequence X 1 ^ 1 , which is an M. (Af)-valued random vector, 
while P{Xi\X l ~ v ) is the random variable denoting the Xjth component of P(xi\X % ~ 1 ). Throughout this 
paper, log(-) means log 2 (-) and ln(-) means log e (-). 

A. Directed Information 

The directed information from X n to Y n is defined as 

n 

I(X n -»• Y n ) 4 J2 J { xi ; Y i\ Yi ~ l ) = H ( Yn ) - H(Y n \\X n ), 
i=l 

where H(Y n \\X n ) is the causally conditional entropy (3), defined as 

n 

H{Y n \\X n ) 4 ^HiYilY*- 1 ^). 
i=i 

Compared with the definition of mutual information, 

I(X n ;Y n ) = H(Y n ) - H{Y n \X n ), 

directed information has the causally conditional entropy in place of the conditional entropy. Unlike 
mutual information, directed information is not symmetric, i.e., I(Y n — > X n ) 7^ I(X n — >• Y n ) in 
general. 



October 23, 2012 



DRAFT 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 



5 



The following notation of causally conditional pmfs will be used throughout: 

n 

p(x n \\y n ) = l[ P (x i \x i -\y i ), 
i=i 

n 

p(x n \\y n - 1 ) = l[ P (x l \x*-\tf- 1 ). 

i=i 

It is easily verified that 

p(x n ,y n )=p(y n \\x n )p(x n \\y n - 1 ), 
and that we have the conservation law: 

I{X n ;Y n ) = I{X n -> Y n ) + /(Y"- 1 -»• X n ), (1) 

where 

7 (yn-i ^ x 11 ) = /((0, r- 1 ) -> x n ) = F(x n ) - ^ (^i^- 1 , y* -1 ) 

i=l 

denotes the reverse directed information. Other interesting properties of directed information can be found 

in m, AH- 

The directed information rate [3 ] between a pair of jointly stationary finite-alphabet processes X and 
Y is defined as 

/(X -> Y) 4 lim -7(X n -> Y n ). 

The existence of the limit can be checked as follows O: 

J(X -> Y) = lim -I{X n -> y n ) 

= lim i(#(Y n ) -tf(Y n ||X n )) 

n— >oo n 

1 n i n 

= lim - Y HiYAY 1 - 1 ) - lim - HtYAY^ 1 , X*) 

i=l i=l 

= ^(yoiyri) - #(YoI*°co, yr^), 

where the last equality is obtained via the property of Cesaro mean and standard martingale arguments, 
see HU Ch. 4, Ch. 16]. Note that the entropy rate H(Y) of the process Y is equal to H(Y \YZ^ o ), and 
the causally conditional entropy rate is defined as 

H(Y\\X) 4 Mm {l/ n )H{Y n \\X n ) = ff(Y |*°oo, Y^), (2) 
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thus, 



/(X ->■ Y) = H(Y) - H(Y\\X). 



(3) 



This identity shows that if we estimate H(Y) and H(Y||X) separately, and if both estimates converge, 
we have a convergent estimate of the directed information rate. 

B. Universal Probability Assignment 

A probability assignment Q consists of a set of conditional pmfs Q{xi\x % ~ 1 ) for every x' L ~ l G X % ~ x 
and % = 1,2,.... Note that Q induces a probability measure on a random process X (and the pmf 

Q{x n ) = Q{xi)Q(x2\xi) ■ ■ ■ Q{x n \x n ^ 1 ) on X n for each n). 

Definition 1 (Universal probability assignment) A probability assignment Q is said to be universal 
for a class & if the normalized relative entropy (Kullback-Leibler divergence) satisfies 



for every probability measure P in SP. A probability assignment Q is said to be universal (without a 
qualifier) if it is universal for the class of stationary probability measures. 

Definition 2 (Pointwise universal probability assignment) A probability assignment Q is said to be 
pointwise universal for a class & if 



for every probability measure P in A probability assignment Q is said to be pointwise universal 
(without a qualifier) if it is pointwise universal for the class of stationary ergodic probability measures. 

It is well known that there exist universal and pointwise universal probability assignments. Ornstein ll24l 
constructed a pointwise universal probability assignment and it was generalized by Algoet [25 ] to Polish 
space. Morvai, Yakowitz and Algoet ll26l used universal source codes to induce a probability assignment 
and showed the universality. Since the quantity i log gnpn is generally unbounded, a pointwise universal 
probability assignment is not necessarily universal. However, if we have a pointwise universal probability 
assignment, it is easy to construct a probability assignment that is both pointwise universal and universal. 
Let Q\{x n ) be a pointwise universal probability assignment and Q2(x n ) be the i.i.d. uniform distribution, 
then it is easy to verify that 



lim -D(P(x n )\\Q(x n )) = 




Q(x n ) = a n Q 2 (x n ) + (1 - a^QiOO 



(4) 
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is both universal and pointwise universal provided a n that decays subexponentially, for example, a n = 
1/n. For more about universal probability assignments see, for example, [27] and the references therein. 

C. Context Tree Weighting (CTW) 

One particularly celebrated sequential probability assignment, which we use in the implementations of 
the estimators described in the next section, is the context tree weighting (CTW) algorithm by Willems, 
Shtarkov, and Tjalken [ 17]. One of the main advantages of the CTW is that its computational complexity 
is linear in the block length n, and the algorithm provides the probability assignments Q directly; see 
ifTTl and ll28l . Note that while the original CTW was tuned for binary processes, it has been extended for 
larger alphabets in ||29l , an extension that we use in this paper. In our experiments with simulated data, 
we assume that the depth of the context tree is larger than the memory of the source. This assumption 
can be alleviated by the algorithm introduced by Willems [30], which we will not implement in this 
paper. 

(0,0)1 

(1,0)0 

(1.0) 1 

(1.1) 

(0,1)1 

(i,o)0 

(0,1)1 

(o,i)0 

Fig. 1. The CTW algorithm with D =3 and (x-2, x~i, xo, xi, . . . ,xs) = 00011010010. The count starts at xi, for the 
particular sequence in Fig[TJ there are 3 zeros and 1 one with a one as the context, that gives count (3,1) at the node of context 
1 in the upper right of Fig. [TJ 

An example of a context tree with a binary alphabet is shown in Fig. [TJ Each node in the tree corresponds 
to a context. Counts (ai, 02, . . . , «m) stored in node s are the number of different values emitted from the 
corresponding context in the alphabet of size M. For concreteness, assume the alphabet is {0, 1, . . . , M — 
1}. In Fig. [TJ the counts (01,02) are simply numbers of 0's and l's. Let x 1 ^ 1 denote the sequence at 
node s, which is composed of a\ of 0's, of l's, and bm of M — l's. For counts (01, 02, . . . , om), 
the Krichevsky-Trofimov probability estimate 1 311 is defined as follows: P e (0, 0, . . . , 0) = 1 and for 
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ai,a 2 , ■ ■ ■ ,a M > 0, 

a _)_ 1 

P e (ai + l,a 2 , • • • ,a M ) = ; ; — , , r , Pe{a\,a 2 , . . -,a M )- (5) 

ai + a 2 + ■ ■ ■ + olm + M/2 

For a,2, 03, . . . , a-M, the updating rules are similar. 

With a slight common abuse of notation, let s not only denote a specific node, but also its context, then 
we could refer to nodes like Is, which corresponds to context Is. We denote the Krichevsky-Trofimov 
probability estimate at node s as 

P£( xi ~ l ) =P e (ai,a 2 ,...,a M ), 

then we can write the updating rule of Krichevsky-Trofimov probability estimate in Equation ([5]) as 
follows 

elU| ' PKx- 1 ) ai + a 2 + ... + a M + M/2- W 

It is easy to verify that the Krichevsky-Trofimov conditional probability estimate P^x^x 1-1 ) (Equa- 
tion ©) is lowered bounded by since we have 

The weighted probability P^ at node s in the CTW algorithm is calculated as 



ps 



lP! + hUf=iPi s o<i(s)<d 

P! Ks) = D 



where the node is is the i th child of node s, and l(s) is the depth of node s. When we build the context 
tree from sequence x™, we add one symbol at a time. In adding symbol xt, we have to update the counts 
(a\,a 2 ,... ,clm), the estimated probability P e s , and the weighted probability for each context s of 
x t . The order of updating is from the context of the longest depth (a leaf node) to the root. 
As in ll28l . we define /3 s (x n ) for node s as 

ps( r n\ 

P s {x n ) 4 Y {X > . (8) 

For an internal node s in the updating path, if Is is in the updating path, we calculate the weighted 
conditional probability estimate at node s as 

^(x* -1 ) ■ 1 1 1 1 

P*(a; i |x ) = — ^ ^ P*(xi\x l ) + p s ^ x i-i^ + ^P w S \ x i\ x )• ( 9 ) 
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The term P^(rcj|x 4_1 ), where A denotes the root node of the context tree, gives the probability assign- 
ment in the CTW, which will be denoted as Q(x{\x ) in Section Hill Since the Krichevsky-Trofimov 
conditional probability estimate is lower bounded (Equation (O), so is Q{xi\x l ~ l ), which is used to prove 
consistency of the estimators in Section [TlTJ 

The probability assignment Q in the CTW is both universal and pointwise universal for the class of 
stationary irreducible aperiodic finite-alphabet Markov processes. For the proof of universality, see ifTTl . 
The pointwise universality is proved in Appendix A Lemma [2l 



In this section, we introduce four estimators of the directed information rate J(X — >• Y) between a 
pair of jointly stationary ergodic processes with finite alphabets X and Y, respectively. Let M(X,y) be 
the set of all probability distributions on X x y. Define / as the function that maps a joint pmf P(x,y) 
of a random pair (X,Y) to the corresponding conditional entropy H(Y\X), i.e., 



where P{y\x) is the conditional pmf induced by P(x,y). Take Q as a universal probability assignment, 
either on processes with (X x 3^-valued components, or with ^-valued components, as will be clear 
from the context. 

Recall the definition of directed information from X n to Y n 



III. Four Estimators 




(10) 



n 



I(X n ^Y n ) = ^I{X i -Y i \Y i - 1 ) = H{Y n ) -H{Y n \\X n ) 



i=l 



we give the four estimators as follows 



h(X n -> Y n ) 4 H x {Y n ) - H 1 {Y n \\X n ) 



(11) 



I 2 (X n -> Y n ) 4 H 2 (Y n ) - H 2 (Y n \\X n ) 



(12) 



jg(X" ^ Y n ) ^-^2D(Q{y i \X i ,Y i - 1 )\\Q(y i \Y i - 1 )) 

n * — ' 



(13) 



h{X n ^Y n ) ^-y^D(Q(x t+u y t+1 \X\Y l )\\Q(y l+1 \Y l )Q(x l+1 \X\Y^ 



(14) 



i=l 
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where 



H x {Y n \\X n ) 4 --logQ(Y n \\X n ) 



(15) 



n 



H 2 (Y n \\X n ) 4 - V/(Q(z m ,y i+ i|*\n) 



(16) 



i=l 



and = Hi(Y n \\$), H 2 (Y n ) = H 2 {Y n \\$). 

Recall that Q(yi\X\ Y l ~ x ) denotes the conditional pmf Q(yi\x\ y l ~ l ) evaluated for the random 
sequence {X l ,Y % ^ 1 ), and Q(Y n \\X n ) denotes the causally conditional pmf Q(y n \\x n ) evaluated for 
(X n ,Y n ). Thus, an entropy estimate such as Hi(Y n \\X n ) is a random variable (since it is a function 
of (X n ,Y n )), as opposed to entropy terms such as H(Y n \\X n ), which are deterministic and depend on 
the distribution of (X n ,Y n ). 

Note that the universal probability assignments conditioned on different data are calculated separately. 
For example, Qiy^Y 1 ^ 1 ) is not computed from Q{xi, yi\X l ~ l , Y l ~ x ), but from running the univer- 
sal probability assignment algorithm again on dataset Y l ~ l . In the case of Q(Yi\X l , Y l ~ l ), which is 
inherent in the computation of Q(Y n \\X n ), the estimate is computed from Q{xi, yi\X l ~ l , Y 1 ^ 1 ) via 



Note that I 2 and I4 are not analytically identical. In I 2 , the two terms, H 2 (Y n ) and H 2 (Y n \\X n ) are 
calculated separately, so are the probability assignments Q used in these two terms. In I4, the probability 
assignment Q is only calculated once. It is also worthy to note that J4 involves an average of Xi + \ in 
the KL divergence for each i, which makes it analytically different from 1%. 

Here is the big picture of the general ideas behind these estimators. The first estimator, I\, is calculated 
through the difference of two terms, each of which takes the form of Equation ( fT5T ). Since Shannon- 
McMillan-Breiman theorem guarantees the Asymptotic Equipartition Property (AEP) of entropy rate |[23l 
as well as directed information rate |32ll , it is natural to believe that l\ would converge to the directed 
information rate, which is proved in Appendix B. The Shannon-McMillan-Breiman type estimators have 
been widely applied in the literature of information-theoretic measure estimation, for example, divergence 
estimation by Cai, Kulkarni, and Verdu fT5l , and erasure entropy estimation by Yu and Verdii flT9l . 

Equation ( fT5T ) can be re-written in the Cesaro mean form, i.e., 



and estimators I 2 to I4 are derived through changing every term in the Cesaro mean to other functionals 



Q{Yi\X\ Y^ 1 ) = Q{X h Y i \X i ~ 1 ,Y i - 1 )/ QiX^X^ 1 ^ 1 ). 




(17) 
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of probability assignments Q. For concreteness, estimator I2 uses conditional entropy as the functional, 
estimators I3 and I4 use relative entropy. 

One disadvantage of I\ is that it has a nonzero probability of being very large, since it only averages 
over logarithms of estimated conditional probabilities, while the directed information rate that it estimates 
is known to be bounded (e.g., by log |^|). 

The estimator I2 is the universal directed information estimator introduced in 11211 . Thanks to the use 
of information-theoretic functionals to "smooth" the estimate, the absolute value of l2{X n — > Y n ) is 
upper bounded by log \y\ on any realization, a clear advantage over I\. 

The common disadvantage of 1% and I2 is that they are computed by subtraction of two nonnegative 
quantities. When there is insufficient data, or the stationary assumption is violated, I\ and I2 may generate 
negative outputs, which is clearly undesirable. In order to overcome this, ^3 and I4 are introduced, which 
take the form of a (random) relative entropy and are always nonnegative. Section IV-DI gives an example 
where I\ and I2 give negative estimates, which might be caused by the fact that stock market isn't 
stationary in short term. 

IV. Performance Guarantees 

In this section, we present consistency of the proposed estimators, mainly in the almost sure and L\ 
senses. Under some mild conditions, we derive near-optimal rates of convergence in the minimax sense. 
Proofs of the stated results are in the appendices. 

Theorem 1 Let Q be a universal probability assignment and finite-alphabet process (X, Y) be jointly 
stationary ergodic. Then 

lim h(X n -»• Y n ) = J(X -> Y) in L x . (18) 

n— >oo 

Furthermore, if Q is also a pointwise universal probability assignment, then the limit in (fT8l) holds almost 
surely as well. 

The proof of Theorem Q] is in Appendix IB-AI If (X,Y) is a stationary irreducible aperiodic finite- 
alphabet Markov process, we can say more about the performance of 1\ using the probability assignment 
in the CTW algorithm. 

Proposition 1 Let Q be the probability assignment in the CTW. If (X, Y) is a jointly stationary irre- 
ducible aperiodic finite-alphabet Markov process whose order does not exceed the prescribed maximum 
depth in the CTW, and Y is also a stationary irreducible aperiodic finite-alphabet Markov process with 
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the same order as (X, Y), then there exists a constant C\ such that 



E 



i\{X n — >• Y n ) — J(X — >■ Y) <Cin _1/2 log 



and Ve > 0, 



h{x r 



yr, 



/(X 



o^-^Oogn) 5 / 2 ^). P-a.s. 



The proof of Proposition Q] is in Appendix IB-BI 

We can establish similar consistency results for the second estimator I 2 in (fl2l ). 

Theorem 2 Let Q be a universal probability assignment, and finite-alphabet process (X, Y) be jointly 
stationary ergodic. Then 

lim I 2 (X n y n ) = J(X -»■ Y) in L x . 



The proof of Theorem [2] is in Appendix IB -CI As was the case for I\, if the process (X, Y) is a jointly 
stationary irreducible aperiodic finite-alphabet Markov process, we can say more about the performance 
of I 2 using the CTW algorithm as follows: 

Proposition 2 Let Q be the probability assignment in the CTW. If (X, Y) is a jointly stationary irre- 
ducible aperiodic finite-alphabet Markov process whose order does not exceed the prescribed maximum 
depth in the CTW, and Y is also a stationary irreducible aperiodic finite-alphabet Markov process with 
the same order as (X, Y), then 

lim I 2 (X n -> Y n ) = J(X -»■ Y) P-a.s. and in L u 

n— >oo 

and there exists a constant C 2 such that 



E 



I 2 (X n -> Y n ) - /(X -> Y) < Csn-^^logn) 3 / 2 . 



The proof of Proposition [2] is in Appendix IB-DI 

We also investigate the minimax lower bound of estimating directed information rate, and show the 
rates of convergence for the first two estimators are optimal within a logarithmic factor. Note that entropy 
rate is a special case of directed information rate if we take process Y = X, so the minimax lower bound 
also applies in the universal entropy estimation situation. Actually in the proof of proposition [3j we indeed 
reduce the general problem to entropy estimation problem to show the minimax lower bound. 

Proposition 3 Let , P(X,Y) be any class of processes that includes the class of i.i.d. processes. Then, 
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there exists a positive constant C3 such that 

inf sup E\I - I(X -+ Y)| > Csn- 1 / 2 , 
i v{x,y) 

where the infimum is over all estimators I of the directed information rate based on (X n ,Y n ). 

The proof of Proposition [3] is in Appendix IB -El Evidently, convergence rates better than 0(n _1 / 2 ) is not 
attainable even with respect to the class of i.i.d. sources and thus, a fortiori, in our setting of a much 
larger uncertainty set. 

For the third and fourth estimators, we establish the following consistency results using the CTW 
algorithm. 

Theorem 3 Let Q be the probability assignment in the CTW. If (X, Y) is a jointly stationary irreducible 
aperiodic finite-alphabet Markov process whose order does not exceed the prescribed maximum depth in 
the CTW, and Y is also a stationary irreducible aperiodic finite-alphabet Markov process with the same 
order as (X, Y), then 

lim h{X n Y n ) = J(X Y) P-a.s. and in L x . 

n— s-oo 

Theorem 4 Let Q be the probability assignment in the CTW. If (X, Y) is a jointly stationary irreducible 
aperiodic finite -alphabet Markov process whose order does not exceed the prescribed maximum depth in 
the CTW, and Y is also a stationary irreducible aperiodic finite-alphabet Markov process with the same 
order as (X,Y), then 

lim I 4 (X n Y n ) = J(X Y) P-a.s. and in L x , 

n— >oo 

The proofs of Theorem [3] and Theorem @] are in appendices IB-FI and IB-GI 

Remark 1 The properties of the CTW probability assignment we use in the proofs of Theorem [3] and 
Theorem[4]are not only universality and pointwise universality, but also lower boundedness (recall Section 
II-C). 

Remark 2 Note that the assumption that (X, Y) is a jointly stationary irreducible aperiodic finite- 
alphabet Markov process doesn't imply Y also has these properties. Suppose X is a Markov process 
of order m, Y is a hidden Markov process whose internal process is X, then it is obvious that joint 
process (X, Y) is Markov with order m, but Y is not a Markov process. In applications, it is sensible 
to assume that a process Z can be approximated by Markov processes better and better as the increase 
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of the Markov order, i.e., there exists constants C > 0, < p < 1, such that 

0<H(Z \ZZl)-H(Z)<^pK (19) 

It deserves mentioning that the exponentially fast convergence in Equation ( fT9l can be satisfied under 
mild conditions. For example, as shown in Birch |[33l , let G be a Markov process with strictly positive 
transition probabilities, and Z n = ij){G n ), then Equation ( fT9l holds. For more on this "exponential 
forgetting" property, please refer to Gland and Mevel ll34l and Hochwald and Jelenkovic l35l . 

The properties established for the proposed estimators are summarized in Table HI 

TABLE I 

Properties of the Proposed Estimators 





Support 


Rates of convergence 


h 


(— oo, oo) 


0(n -1 / 2 logn) 


h 


[-log|y|,log|y|] 


0(n- 1 /2(i ogn )3/2) 


h 


[0,oo) 




h 


[0,oo) 





V. Algorithm and Numerical Examples 

In this section, we use the context tree weighting (CTW) as the universal probability assignment 
to describe the corresponding algorithms and experiment on simulated as well as real data. The CTW 
algorithm |17 ] has a linear computational complexity in the block length n, and it provides the probability 
assignment Q directly. A brief introduction on how the CTW works can be found in Section III-CI 

For simplicity and concreteness, we explicitly describe the algorithm for computing I2. The algo- 
rithms for the other estimators are identical, except for the update rule, which is given, respectively, by 
Equations (Qj} to (fl4l) . 

We now present the performance of the estimators on synthetic and real data. The synthetic data is 
generated using Markov processes that are passed through simple channels such as discrete memory chan- 
nels (DMC), or channels with intersymbol interference. We compare the performances of the estimators 
to each other, as well as the ground truth, which we are able to analytically compute. We also extend 
the estimators to estimation of directed information with delay, and to estimation of mutual information. 
Further, we show how one can use the directed information estimator to detect delay of a channel, and to 
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Algorithm 1 Universal estimator I 2 based on the CTW algorithm 



Fix block length n and context tree depth D. 

h ^0 

for i <— 1, n do 

Zi = (xi,yi) > Make a super symbol with alphabet size |<^||y| 

end for 

for i •<— D + 1, n + 1 do 

Gather the context z|Zd for the ith symbol Zj. 

Update the context tree for every possible value of Z{. The estimated pmf Q{zi\Z' L ~ 1 ) is obtained 
along the way. 

Gather the context y % ~} D for the zth symbol y^. 

Update the context tree for every possible value of y^. The estimated pmf Q(yi\Y l ~ l ) is obtained 
along the way. 

Update I 2 as I 2 <- J 2 + f(Q(x h y^X*- 1 , Y^ 1 )) - f(Q(y l \Y i - 1 )) where /(•) is defined in 
Equation (fTOb . 
end for 

h <- h/{n - D) 



detect the "causal influence" of one sequence on another. Finally, we apply our estimators on real stock 
market data to detect the causal influence that exists between the Chinese and the US stock markets. 

A. Stationary Hidden Markov Processes 

Let X be a binary symmetric first order Markov process with transition probability p, i.e. P(X n 7^ 
X n -i\X n -i) = p. Let Y be the output of a binary symmetric channel with crossover probability e, 
corresponding to the input process X, as depicted in Fig. [2] 



, X- 1 , Xq . Xi 




Fig. 2. Section IV-AI setup: X is a binary first order Markov process with transition probability p, and Y is the output of a 
binary symmetric channel with crossover probability e corresponding to the input X. 



We use the four algorithms presented to estimate the directed information rate J(Y — > X) for the case 
where p = 0.3 and e = 0.2. The depth of the context tree is set to be 3. The simulation was performed 
three times. The results are shown in Fig. [3] As the data length grows, the estimated value approaches 
the true value for all four algorithms. 
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h(Y n -» X n ) I 2 (Y n -> X n ) 




Fig. 3. Estimation of I(Y — } X): The straight line is the analytical value. 

The true value can be simply computed analytically as 

I(Y n X n ) = H(X n ) - H{X n \\Y n ) 

n 

= Y j h{x 1 \x 1 - 1 ) - tfpq|jr-\Y 1 ') 
i=i 

n 

( = } J2 H ( X i\ X i-i) - HiXilX^Yi) 
i=i 

( = } £ H b(p) ~ (?* + ^)H 6 f - (pe + pe)ff 6 f , 

where (a) follows from the Markov property of the input process and the memorylessness of the channel 
and in (b) p denotes I — p. 

One can note from Fig. [3] that the sample paths of I 2 and I4 indeed appear to be smoother, as one 
might expect from that fact that they use the entropy and divergence functional on the pmf estimate 
Q(xi,yi\Y l ~ 1 , X 1 ^ 1 ). The first estimator is apparently the least smooth, since it uses the probability 
assignments evaluated on the sample path, and is highly sensitive to its idiosyncrasies. 
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B. Channel Delay Estimation via Shifted Directed Information 

Assume a setting similar to that in Section IV-AI a stationary process that passes through a channel, 
but now there exists a delay in the entrance of the input to the channel, as depicted in Fig. 01 



■■■X 



-i,Xq,X]_, ■ ■ ■ 










D' units delay 




A channel 


► 




with memory 













Fig. 4. Using the shifted directed information estimation to find the delay D' . 

Our goal is to find the delay D'. We use the shifted directed information I(Xd+i ~~ ^ X n ~ d ) to estimate 
D', where I(Y£ +1 -»• X n ~ d ) is defined as 

n— d 

I(Y d n +1 -> X n - d ) ^^HiX^X 1 - 1 ) - HiXilX*- 1 ,^). 

i=l 

To illustrate the idea, suppose X is a binary stationary process, and we define the binary process Y 
as follows 

Y, = X t _ D , + Xi-jy-x + W i} (20) 

where Wi ~ Bernouli(e) and addition in Equation (|20l is modulo 2. The goal is to find the delay D' 
from the observations of the processes Y and X. Note that the mutual information rate lim -I(Y n ; X n ) 
is not influenced by D' . However, the shifted directed information rate lim -^-sI(YV\^ X n ~ d ) 

J n->oo n ~ d 1 d+1 1 

is highly influenced by D' . Assuming that there is no feedback, for d < D' we have the Markov 
chain Y*+ d X^ 1 -)■ X { due to (H3, and therefore I(X5+i ^ Xn ~ d ) = °- However, for d > D', 
^■O^d+i ~^ X n ) > 0. For instance, in the channel example (f20|, if W{ = with probability 1 then 
for d > D', I{Xd+\ ~^ X n ~ d ) = H(X n ~ d ). Therefore, we can use the shifted directed information 
H Y d+i ~> xn ~ d ) to estimate D' . 

Fig. [5] depicts l2(YJ l +l — > X n ~ d ) where n = 10 6 for the setting in Fig. 01 where the input is a binary 
stationary Markov process of order one and the channel is given by ( f2Qb . The delay of the channel, D' 
is equal to 2. We use I2 to estimate the shifted directed information (all algorithms perform similarly for 
this case) where the tree depth of the CTW is set to be 6. The result in Fig. [5] show that for d < D', 
^2(^44.1 - > X n ~ d ) is very close to zero and for d > D' , l2(Y^ +l — > X n ~ d ) is significantly larger than 
zero. 
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0.5 



0.4 



0.3 



0.2 



0.1 



0^- 



n—d\ 





d 



Fig. 5. The value of Ia(Yj +1 ->■ X n ~ d ) where n = 10 6 for the setting depicted in Fig. g] with D' = 2. When d < 2, 
/aOd+i — > A rn_d ) is very close to zero and for d > 2, ^(VJ+i ~ ► -X" n ~ d ) is significantly larger than zero. 



C. Causal Influence Measurement 

There is extensive literature on detecting and measuring causal influence. See, for example, [36] for a 
recent survey of some of the common tools and approaches in biomedical informatics. One particularly 
celebrated tool, in both the life and economics sciences, for assessing whether and to what extent one 
time series influences another is the Granger causality test [9]. The idea is to model Y first as a univariate 
auto-regressive time series with error correction term Vi 

v 

Yi = J2 a i Y i-j + V h (21) 
3=1 

and then model it again using X as causal side information: 

p 

Y i = Y. h Y i-i + c i X i+i-j] + Vi (22) 

3=1 

with Vi as the new error correction term. The Granger causality is defined as 

i var(Fi) 

and the bigger it is, the more inclined the practitioner is to assert that X is causally influencing Y. It is 
a simple exercise to verify that when the process pair is jointly Gauss-Markov with evolution that obeys 
both Equations (T2TT) and (1221 with p = oo, the Granger causality coincides with directed information (up 
to a multiplicative constant) 11371 . 
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In this section, we implement our universal estimators of directed information to infer causal influences 
in more general scenarios, where the Gauss-Markov modeling assumption inherent in Granger causality 
fails to adequately capture the nature of the data. 

One philosophical basis for causal analysis is that when we measure causal influence between two 
processes, X and Y, there is an underlying assumption that Xi happens earlier than Yi for every (Xj, Y^). 
Under this assumption, we say two jointly distributed processes X and Y induce a forward channel 
P(yi\x l , y 1 ' 1 ) and a backward channel P{xi\x 1 " 1 , y l ~ l ), as depicted in Fig. |6j where X is the input 
process. In this section we present the use of directed information, reverse directed information, and 
mutual information to measure the causal influence between two processes. 



forward channel 



Y, 



backward channel 



Pixilx*- 1 ,^- 1 ) 



Delay 



Fig. 6. Modeling any two processes using forward channel P(yi\x', y 1 1 ) and backward channel P(xt\x' 1 ,y' ' 



X: 



BSC(a) 



Y 



BSC03) 



Yi. 



Delay 



Fig. 7. Simulation of a sequence of random variables {Xi,Yi}i > 1 according to the relation shown in the scheme. Namely, 
Yi is the output of a binary symmetric channel with parameter a and input Xi and Xi is the output of a binary symmetric 
channel with parameter f3 and input Yi-\. The initial random variable Xi is assumed to be distributed Bernoulli(i) 



Definition 3 (Existence of a channel) We say that the forward channel does not exist if P(yi\x l , y 1 ^ 1 ) - 

P{yi\y l ~ l ) for i > 1 and similarly the backward channel does not exist if P(xi\x l ~ 1 , y 1 ^ 1 ) = P(xi\x l ~ 1 ] 
for i > 1. 



We say that existence of the forward link means that the sequence Y is "influenced" or "caused" by 
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the process X. Similarly, existence of the backward link means that X is "influenced" or "caused" by 
the sequence Y. We want to answer the following two questions: 

1) Does the forward channel exist? 

2) Does the backward channel exist? 

Directed information can naturally answer these questions. It is straightforward from the definition 
of directed information to note that the forward link exists if and only if I(X n — > Y n ) > and 
the backward link exists if and only if I(y n_1 — > X n ) > 0. More generally, the directed information 
I(X n — > Y n ) quantifies how much X influences Y, while the directed information in the reverse direction 
I(Y n ~ 1 — > X n ) quantifies how much Y influences X. The mutual information, which is the sum of 
those two directed informations, (Equation £[])), quantifies the mutual influence of the two sequences. 
Therefore, using the directed information measures, it is natural to adopt terminology as follows: 

Case A: I(X n -> Y n ) > /(F n_1 -> X n ), we say that X causes Y 

Case B: I(X n -> Y n ) < I(Y n - 1 -»• X n ), we say that Y causes X 

Case C: I(X n — > Y n ) ~ I(Y n ^ 1 — > X n ) S> 0, we say that the processes are mutually causing each 
other. 

Case D: I(X n ; Y n ) = 0, we say that the processes are independent of each other. 
To illustrate this idea, consider processes X and Y generated by the system that is depicted in Fig. 
13 where the forward channel is a BSC(a) and the backward channel is a BSC(/3) where < a < | 
and < P < ^. Intuitively, if a is much less than (5, then the process X is influencing Y, and if a is 
much larger than /3, the process Y is influencing X. If a and /3 have similar values then the processes 
mutually influence each other, and finally if they are both equal to i, then the processes are independent 
of each other. Note that the information-theoretic measures can be analytically calculated as in d23l-(|25l), 
and indeed if I(X n — > Y n ) > I(y n_1 — > X n ), then a < [3 and vice versa. Hence the intuition regarding 
which process influences the other is consistent with Cases A through D presented above. 

-I(X n -> Y n ) = -Y'H(Y i \Y i - l )-H(Y l \Y i - 1 ,X i ) (23) 

i=l 

= H b (aP + af3)-H b (a) (24) 
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where the terms a and f3 denote 1 — a and 1 — /3 respectively. Similarly, we have 




1 



n 



n-1 



n 



{HiXilX^ 1 ) - HiXilX*- 1 ^- 1 )) 



i=l 



H b {aP + a(3) - H b {fi) 



and 



-I(Y n ;X n ) 



-I(X n -> Y n ) + -I{Y 



n 



1 



n n 



2H b (ap + aP) - H b (P) 



H b {a). 



(25) 



Since the normalized reverse directed information is nothing but the normalized directed information 
between another pair of processes, where one is shifted, the estimators I\ to I4 can be easily adapted to 
this situation, and the convergence theorems (Theorem Q] to Theorem [4]) apply also (with the appropriate 
translations) to the reverse directed information. Finally, the normalized mutual information can be esti- 
mated once we have the normalized directed information and the normalized reverse directed information 
simply by summing them. 



Fig. [8] depicts the estimated and analytical information-theoretic measures ^I(X n — > Y n ), ^KY 71 ^ 1 — > 
X n ), and ^ i I(X n ;Y n ) for the case a = 0.1 and f} = 0.2. One can note that with just a few hundreds 
of samples, directed information and reverse directed information start strongly indicating that a < (3, in 
other words, X influences Y more than the other way around. 

D. Causal Influence in Stock Markets 

Here we use the history data of the Hang Seng Index (HSI) and the Dow Jones Index (DJIA) between 
1990 and 2011 to compute the directed information rate between these two indexes. The data of those 
two indexes are presented in Fig. [9] on a daily time scale. 

There is no time overlap between the stock market in Hong Kong and that in New York, that is, 
when the stock market in Hong Kong is open, the stock market in New York is closed and vice versa. 
Therefore the causal influence between the markets is well defined. Since the value of the stock market 
is continuous, we discretize it into three values: —1, 0, and 1. Value —1 means that the stock market 
went down in one day by more than 0.8%, value 1 means that the stock market went up in one day by 
more than 0.8%, and value means that the absolute change is less than 0.8%. 

We denote by Xi and Yi the (quantized ternary valued) change in the HSI and the DJIA in day 
i, respectively, and estimate the normalized mutual information ^I(X n ;Y n ), the normalized directed 
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Alg. 1 




10 2 10 3 10 4 10 5 



n 

Fig. 8. The information-theoretic measures ^I(X n — > Y n ), 
algorithms. The data was generated according to the setting ir 
the analytical value given by eq. J231>-d25 1 and the blue lines a 



Alg. 2 

0.6 F ' 




10 2 10 3 10 4 10 5 



n 

Ij(yn-i X n^ and evaluated using the four 

Fig. [7] where a — 0.1 and /3 — 0.2. The straight black line is 
e the estimated values. 



information \l{X n — > Y n ), and the normalized reverse directed information ^I(Y n ~ l — > X n ), using 
all four algorithms. Fig. [10] plots our estimates of these information-theoretic measures. 

Evidently, the reverse directed information is much higher than the directed information; hence there 
is significant causal influence by the DJIA on the HSI, and a low influence in the reverse direction. In 
other words, between 1990 and 2011, it was the Chinese market that was influenced by the US market 
rather than the other way around. 

It is also worthy to note that estimators 1\ and I2 do generate negative outputs as shown in Fig. [TO] It 
may be caused by various reasons, such as data insufficiency and non-stationarity of process (X, Y). In 
the case of insufficient data, we might prefer estimators J3 and J4, since they are always non-negative, 
which can be sensibly interpreted in practice. 
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Fig. 9. The Hang Seng Index (HSI) and the Dow Jones Industrial Average (DJIA) between 1990 and 2011. The goal is to 
determine which index is causally influencing the other. 



We have presented four approaches to estimating the directed information rate between a pair of jointly 
stationary ergodic finite-alphabet processes. Weak and strong consistency results have been established 
for all four estimators, in precise senses of varying strengths. For two of these estimators we established 
rates of convergence that are optimal to within logarithmic factors. The other two have their own merits, 
such as nonnegativty on every sample path. Experiments on simulated and real data substantiate the 
potential of the proposed approaches in practice and the efficacy of directed information estimation as a 
tool for detecting and quantifying causality and delay. 
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Fig. 10. Estimates of information-theoretic measures between HSI denoted by X, and DJI denoted by Y. It is clear that the 
reverse directed information is much higher than the directed information, hence it is DJI that causally influences HSI rather 
than the other way around. 



Appendix A 
Some Key Lemmas 

Here is the roadmap of the appendices. In Appendix A, we list some key lemmas without proofs, and 
in Appendix B we prove the main theorems and propositions in section [IVJ Appendix C provides proofs 
for lemmas in Appendix A. 

The first lemma is on the AEP (Asymptotic Equipartition Property) of causally conditional entropy 
rate. It was proved in |32] that the AEP for causally conditional entropy rate holds in the almost sure 
sense, and here we prove it also holds in the L\ sense. We also show rates of convergence under the 
conditions that the processes we study are jointly stationary irreducible aperiodic Markov processes. 

Lemma 1 Let (X, Y) be a jointly stationary ergodic finite-alphabet process, then the AEP for causally 
conditional entropy rate holds 

- - \ogP(Y n \\X n ) -> F(Y||X) P-a.s. and in L x . (26) 
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Further, if (X, Y) is a jointly stationary irreducible aperiodic Markov process, then 



E 

and 



--\ogP(Y n \\X n ) -iT(YIIX) 
n 



0(n- 1/2 logn), (27) 



- - \ogP(Y n \\X n ) -»■ F(Y||X) = o(n" 1/2 (log n) 5/2+e ) P-a.tf. Ve > 0. (28) 
n 

Lemma 2 Let X fre a stationary irreducible aperiodic finite-alphabet Markov process whose order is 
bounded by the prescribed depth of the context tree in the CTW. If Q is the probability assignment in the 
CTW, then the conditional probability assignment converges to the true conditional probability almost 
surely, i.e., 

Q(x n+ i\X n ) - P(x n+ i\X n ) — > P-a.s. as n — y oo 
Remark 3 Lemma [2] partially relies on the proof of Theorem 2 lff5l . 

Lemma 3 [Lemma I, H2I\I 1 For any e > 0, there exists K e > such that for all P and Q in M(X,y): 

\f(P)-f(Q)\<Z + Ke\\P-Q\\l, 

where \\ ■ ||i is the l\ norm (viewing P and Q as \X\\y\- dimensional simplex vectors), and f is defined 
in Equation riiOD . 

Lemma 4 Let P,Q be two probability mass functions in M(X,y), denote 9 = \\P — Q\\i, if 9 < 1/2, 
we have 

|/(P)-/(Q)|<201og£ 



where f is defined in Equation diOD . 

Lemma 5 Let X be a stationary irreducible aperiodic finite-alphabet Markov process. For fixed i > 1, 
let random variable Vi(X\_ m ) be a deterministic function of random vector X\_ m , where m is the 
Markov order. Suppose V{ is uniformly bounded by constant V for any i, and EV^ = 0,Vi > 1, then 
there exists a constant C4 such that 

2 



Lemma 6 (Breiman's generalized ergodic theorem) [38] Let X be a stationary ergodic process. If 
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lim <7fc(X) — > g(X) P-a.s., and E[sup|<?fc|] < oo, then 

k—too J. 



1 n 

lim - V g k (T k (X)) = Eg(X) P-a.s. 



n— >>oo n 

k=l 

where T(-) is the shift operator which increases the index by 1, and T k increases the index by k. 

Here we paraphrase a result from [23 on the redundancy bounds of the CTW universal probability 
assignment. 

Lemma 7 H29V Let X be a stationary finite -alphabet Markov process whose order doesn 't exceed the 
prescribed maximum depth in the CTW algorithm. Let Q be the universal probability assignment generated 
by the CTW algorithm, P be the true probability law under X, then there exist constants C5 , Cq such 
that the pointwise redundancy is bounded as 

max I lo S 7T7-^T ~ lo S FTT^Tn I < C 5 log n + C 6 (29) 



Q(x n ) P(x n ) 

where C5 > 0, C% depend on nothing but the parameters specifying the process X. In particular, taking 
expectation over the inequality with respect to P, the redundancy is bounded as 

D(P(x n )\\Q(x n )) < CVogn + Qs. (30) 

Remark 4 The constants C5, Cq can be specified once the parameters of process X are given. For 
example, see 11291 , where 

Cj _ (7 - 1)|5| 



2 |5| 1 1 V7- 1 7 7- 1 

Here 7 is the size of alphabet, in this case 7 = \S\ is the number of states in the Markov process, 
given Markov order m, \S\ < \X\ m . 

Appendix B 
Proofs of Theorems and Propositions 

For brevity, in the sequel we denote H 1 {Y n \\X n ) by H x , H 2 (Y n \\X n ) by H 2 , h{X n Y n ) by 
4 1 = 1,2,3, 4. 



October 23, 2012 



DRAFT 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 



27 



A. Proof of Theorem [7] 

Briefly speaking, we need to show estimator I\ converges to the corresponding directed information 
rate J(X — > Y) for any jointly stationary ergodic process (X, Y). Since 1% is defined in Equation (fTTT) 
as H\(Y n ) — Hi(Y n \\X n ), if we can show the corresponding convergence properties of H\(Y n \\X n ), 
then we have the desired convergence properties of I\ since H\(Y n ) = Hi{Y n \\$). 

Given Q is a universal probability assignment, first we show I\ converges in L\. Then we show given 
Q is a pointwise universal probability assignment, 1\ also converges almost surely. 

1) L\ convergence: We decompose 



H 1 -H(Y\\X) = C n + D n , 



(31) 



where 



C n = H 1 + -\ogP{Y n \\X* 
n 



D, 



n 



\ogP(Y n \\X n ) -H(Y\\X). 



(32) 
(33) 



According to Lemma Q] shown in Appendix A, we know D n converges to zero in L\. Now we deal with 
C n . Pinsker ll39l proved the existence of a universal constant T > such that 



D(P\\Q) <E P 



log( 



dQ' 



<D{P\\Q)+Y^D(P\\Q), 



(34) 



Barron HOI simplified Pinsker's argument and proved that the constant Y = y/2 is best possible when 
natural logarithms are used in the definition of D(P\\Q). Here we follow Barron's arguments to bound 
E|C n | with C n defined in Equation (l32l . 

Denote the set {(x n ,y n ) : P(y n \\x n ) < Q(y n \\x n )} as B n , we have 



E 



1 , P(Y n \\X r 
log 



_n °Q(Y n \\X n )_ 



+ 2 y P( X ™,y«)Ilog^^ 

v n P(v n \\x r 
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Define C nl = E 



1 lncr P ( Y "II X ") I r o ^ V P(-r n n n \ 1 lno- U '. 1 wp 



bound 



a 



Til 



n 



E E 



log 



—E> 



p(Y|x»,y 



n ^ 



i=i 



E 



log 
log 



Q(Yi\X\Y^) 
PiY^X^X'- 1 ,^- 1 



X i-1 yi-l 



Q(Yi,Xi\Xi-\Yi-i) 



= -D(P(x n ,y n )\\Q(x n ,y n )), 



n 



(35) 



and Vz, consider the set of yi that makes log p[y'|^.'^.-i) positive for given (x l ,y l 1 ), define it as d 

Ciix^tf- 1 ) = { Vl : Piyilx^y 1 - 1 ) < Q{ Vi \x\ y'" 1 )}, we bound C n2 



Cn2 <-Y Y Pix^y'- 1 ) Y P(y i |sSy < - 1 )log ffl < l a; '' y * ]\ 



§-E E ^^me^w -1 )^ 



=1 



P(Y GCilxSy 4 - 1 ; 



(*>) 1 



<VX E ^ 



,i-i > 



i=l (xSj/*- 1 ) 



ln(2) 



(Q(y; g cyx 1 ,?/- 1 ) - p(y g c.ix 1 ,^- 1 )) 



1 



|Q(Y G CilxSj/*- 1 ) - P(Y G C^,^ 1 } 



(c) 1 » 

^-E E JV.y /]n(2) 

i=l (x' )W '-i) V 7 

i=i (xSj/*- 1 ) 1 ' y> 

<l± E W),^ 



i=l (xSj/*- 1 ) 



ln(2) 



<-E 



2n ^ y ln(2) 



<-E 



2n ^ V m ( 2 ) 
i=l v 



EZ)(p( y ,|x%y*-i)||Q(^|x%y*-i; 



E J D(P(y J , Yi-i)\\Q{ yi , x i+1 \X\ Y^) 



(36) 



where 

• (a) is by log-sum inequality, 
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• (b) is because of inequality log(l + x) < x/ln(2),Vx > —1, 

• (c) is because \x\ > x, 

• (d) is by the definition of total variation distance, 

• (e) is by Pinsker's inequality, 

• (f) is by the concavity of \f~-, 

• (g) is because of data processing inequality, 

• (h) is by the chain rule of Kullback-Leibler divergence, the concavity of • and data processing 
inequality. 

Combining Equation (l35l) and (l36l) . we have 

E|C n | < ^(P(^y n )||Q(^y")) + 2J^ (37) 

by definition of universal probability assignment, we show C n converges to zero in L\. Since 

E|fl"i - ff(Y||X)| < E\C n \ +E\D n \ -»• n-^oo, (38) 

we know I\ converges to 7(X — >• Y) in L\. 

2) Almost sure convergence: Consider the probability of the following event 

An,e = {(x n ,y n ) : Hi < --logP(y n \\x n ) - e}, (39) 

n 

we have 

P(A l ,e)= p (x n ,y n ) 

(i",f)eA,« 

J2 P{y n \\x n )P{x n \\y n - 1 ) 

< J2 Q(y n \\x n )2- nt P(x n \\y n - 1 ) 
= 2— Yl Q{y n \\x n )P{x n \\y n - 1 ) 

< 2~ ne , 

where the first inequality is because of the definition of even A n ^, and the last step follows from 
the fact that for any two conditional distributions of the form Q(y n \\x n ) and P(x n \\y n ~~ 1 ), we have 
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Q(y n \\x n )P(x n \\y n r ) = Q(x n ,y n ) where Q is a joint distribution. As 

oo 

J>(.A n ,e) < OO, 
n=l 

by Borel-Cantelli Lemma, we have 



lim inf H ± - ( -- log P(Y n \\X n ) ) > 0. P-a.s. (40) 



n— ¥oo \ n 



In order to get an inequality with reverse direction, write Hi — (— ^ log P{Y n \\X n )) explicitly as 

1 1 P(Y n II X n \ 

H 1 + -\ogP(Y n \\X n ) = -\og { " ' 



n b v " 1 n b Q(Y n \\X r 



1 P(Y n ,X n ) 1, Pf^llF"- 1 ) 

_ log- — i : i log- — (41) 

n ^Q(Y n ,X n ) n & Q(X n \\Y n - 1 ) , K ' 



by the definition of pointwise universality Q, we know 



1 P(Y n X n ) 
lim sup — log — j— , < 0, P-a.s. 



with a similar argument used for showing d40| ). we show 



then we have 



1 , P(X n \\Y n - 1 ) 
lim sup log r , 77— — 7T- < 0, P-a.s. 



lim sup Pi - ( -ilogP(y n ||X") 1 < 0. P-a.s. (42) 



n 



Combining Equation (02]) with d40l . 



lim Hi - (-- log P(Y n \\X n y\ = 0. P-a.s. 

n— >oo \ n J 

By Lemma \T\ shown in Appendix A, 

lim -ilogP(y n ||X n ) - P(Y||X) = 0, P-a.s. 

n— >oo n 

which implies the convergence of I\ to J(X — > Y) also holds almost surely. 

B. Proof of Proposition |7] 

For similar reasons as shown in the proof of Theorem [1] here it suffices to show the convergence 
properties of H\. For convenience, we restate some arguments shown in the proof of Theorem Q] We 
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decompose H\ — iJ(Y||X) as 

H 1 -H(Y\\X.) = C n + D n , (43) 

where 

C n = H 1 + -logP(Y n \\X n ) (44) 
n 

D n = --logP(Y n \\X n )-H(Y\\X), (45) 
n 

and we restate equation ( f37T > 



E|C n | < -D(P(x n , y n ) \\Q(x n , y n )) + 2 J y/D(P(x»+\ y^)\\Q(x^+\y^))/n. (46) 

ij Li convergence rates: We apply Lemma [7] in Appendix A. Plugging Equation d30l) of Lemma [7] 
in equation (l46l ). we have 

E|C n | = CKOogra) 1 / 2 ^ 1 / 2 ). (47) 

Combining Equation (l47l) with the L\ convergence rates of D n shown in Lemma Q] in Appendix A, 
we have 

-F(Y||X)| < E|C n | +E|D n | = 0{n- 1/2 log n), (48) 
then we know the convergence rates in Proposition Q] hold as follows 



E 



h(X n — >• Y n ) — J(X — > Y) =0(n- 1 /2i ogn ). (49) 



2 j Almost sure convergence rates: We look at the almost sure convergence rates of C n (Equation (1441 ) 
at first. We know the probability of event A n ^ defined in Equation d39l ) is bounded as 



^,e) < I'™, (50) 
take e = n~ 1+s ,8 > in Equation 09l ), we have 



.4,,, - A,, - {(x^y^-.n 1 - 5 ' (H l + ±-logP(y n \\x n )) <-n s - s '}.. 



where 5' > S > 0. Note that 

oo oo 



i=l j=l 
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by Borel-Cantelli lemma, since n s ~ s ' goes to zero as n — > oo, we proved that 

liminf n 1 ^' ( H 1 + -\ogP(y n \\x n ) ) > P-a.s. (51) 

n— >oo \ n J 

In order to get an inequality of the reverse direction, divide Equation (|4~T1) by n~ 1+<5 , we have 

n i-*' (hi + - log P(ni W) = n 1 - 5 ' f I log ^r'Tl ) ~ ^ ( ~ log ^rllr'in • (52) 
V n B v 11 7 \n s Q(Y n ,X n )J \n 6 Q(X n \\Y n ~ 1 ) J 

By the pointwise redundancy of the CTW restated in Lemma [7] in Appendix A, we know 

1 P(Y n X n ) 

HmSUp- log nfV n vn\ - 1 P " a - S -' ( 53 ^ 

n -+oo logn Q(Y n ,X n ) 

then we have 

(1 P(Y n X n )\ 
- log ' ^ \ < P-a.s. 

For the second term on the right hand side of Equation d52l ), following similar argument applied to show 
Equation d5TT >. we know 

- ^g L n < P - a " S - 

n & Q(x«||y«- 1 )y - 



then we know 



limsupn 1 -' 5 ' ( H x + - log P(Y n \\X n ) ) < P-a.s.. (54) 



1 



71 



Combining Equation (1511) and (1541) together, we know 

lim H x + - \ogP(Y n \\X n ) = o{n- 1+s ') P-a.s., W > 



(55) 



Putting (I55T ) and the almost sure convergence rates of D n shown in Lemma Q] in Appendix A together, 
we know 

h{X n -> y n ) - J(X -> Y) = o(n- 1 / 2 (bgn) 5 / 2+( ). P-a.s. Ve > 0. 

C. Proof of Theorem [2] 

It suffices to show the convergence properties of H2. We decompose 

H 2 (Y n \\X n ) - H(Y\\X) = A n + B n , 
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where 



1 n _ 

A n = -J2 f{P{xk+i,y k+ i\X\ Y k )) - (Y||X) (56) 

1 n i n 

B n = ~ Y, f(Q( x k + i,yk + i\X k , Y k )) - - V f(P(x k+1 , y k+1 \X k , Y k )). (57) 



fc=l fc=l 
A c n/'- .. ivO vO 



Define ^(X, Y) = f(P(x\, yi\X_ k , Y_ k )) for a jointly stationary and ergodic process (X, Y). Note that, 
by martingale convergence [41], g k (X, Y) ->• g(X, Y),P-a.s., where g(X,Y) = f(P(x u yilX ^, Y ^)). 
Noting further that Eg(X, Y) = i7(Y||X) and \/k,g k are bounded, we can apply Lemma [6] in Appendix 
A and get the following result: 



lim A n = P-a.s. and in L\. 

n— yoo 

Then we deal with B n defined in Equation (157T ). Fix an arbitrary e > 0, we bound 

E 



H 2 (X n ,Y n ) --y2f(P(x k+1 ,y k+1 \X k ,Y k )) 

- T(f(Q(x k+1 ,y k+1 \X k ,Y k )) - f(P(x k+1 ,y k+1 \X k ,Y k )) 
n V 
k=l 

1 n I 

< -Ey2\f(Q(x k+1 ,y k+1 \X k ,Y k )) - f(P(x k+u y k+1 \X k ,Y k )) 



E 



fc=i 



(a) 1 



< - V E e + K e \\Q(x k+1 , y k+1 \X k , Y k ) - P(x k+1 , y k+1 \X k , Y*)^ 



k=l 



~ n ^ 

k=l 

( c ) K f 



2 \u(2)D {P{x k+l ,y k+l \X\Y*)\\Q{x k+l ,y k+l \X^Y*)) 



+ e 



^21n(2)E [D (P(x k+1 ,y k+1 \X\ Y*)\\Q{x k+1 ,y k+1 \X\ Y k ))] + e 



fc=i 

Kr 



6 + — E V 21n ( 2 ) EL) (^(^+1. yfc+il^ fc . ^ fc )l \Q(xk+i, y k+ i\x k , y fc )) 



fe=i 



W r ^ /21n(2) 



n 



J2 E-D (P(x fc+ i, y fc+ i|xfe, r*)||Q(x fc+1 , y fc+ i|xfc 5 y*)) 
\ fc=l 



n 



(58) 



(59) 
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where 

• (a) comes from Lemma [3] in Appendix A, 

• (b) is due to Pinsker's inequality, 

• (c) and (d) come from the concavity of yf-, 

• (e) is because of the chain rule of the Kullback-Leibler divergence. 
We continue to bound 



lim E 



< lim E 

n—toc 



+ lim E 

n— >oo 



if) 



lim E 

n— >oo 



H 2 (X n ,Y n ) -if(Y||X) 
1 n 

H 2 (X n ,Y n ) - -Y,f(P(xk + uy k+ i\X k ,Y k )) 

k=l 

- f(P(x k+1 ,y k+1 \X k ,Y k )) - H(Y 
fc=l 

1 n 

H 2 (x n ,Y n ) --Y,f( p ( x k + i,y k+ i\x k ,Y k )) 

71 * 



k=l 



<e + lim ga / 2 ln ( 2 ) D (p( x n+l ; | / n+l)||Q( a; w+l > y n+l)) 



n 



(h) 



where (f) is because of equation d58l ); (g) comes from d59l) : (h) is due to Definition [Q Now we can use 
the arbitrariness of e to complete the proof. 



D. Proof of Proposition \2\ 

It suffices to show the convergence properties of H 2 . 

1) Almost sure convergence: For stationary ergodic process (X, Y), let 

<7 fc (X,Y) = f(Q(x ,y \Xzl)) 
5 (X, Y) = f(P(x , yo\Xzlo, y~^o)), 



by Lemma [2] in Appendix A, 



lim 9fc (X,Y)- 5 (X,Y) = P-a.s.. 

k— >oo 
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Since E[sup |<7fc|] < log \y\, by Lemma [6] in Appendix A, 



1 - - 

lim - V^(T fe (X,Y)) = lim H 2 = F(Y||X), 

i — vnn 71 — * n. — ^nn 



n— >oo n 



fc=l 



which justifies the almost sure convergence of Hz- 

2) L\ convergence rates: For convenience, we restate the definitions of A n and B n as follows 

1 n _ 

A - = -T,f( p ( x ^y^\ xk ^ Yk )) -^( y ii x ) 

fc=l 

1 n i n 

s n = - E /(QK+i,yfc+il^ fc ; ^)) - - E y fc+ il^ fc , ^))- 

fe=i fc=i 
Let y fc be f(P(x k+ i, y k+ i\X k , Y k )) - H(Y||X), V be log |^|, and apply Lemma[5]in Appendix A, we 
know 

E\A n \ < t/eaJ = 0(n- 1/2 ). (60) 

Then we bound E|P n | as follows 



E \B n \ =E 



1 n 

-J](/(g(x fe+1 ,y fc+1 |x fc ,y fc ))-/(p(x fc+1 ,y fc+1 |x fc ,y fc ))) 



fc=l 



1 n I 

< -E V /(Q(x fc+ i, y fc+ i|X fc , y fc )) - f(P(x k+1 , y k+ i\X\ Y k )) 
n / — ' I 



< V 2||P(x fe+ i, y fe+ i|X fc , y fc ) - Q(x fc+ i, y fc+ i|X fc , y fc )||x 
n f— ? 



fc=i 

n 



k=l 



X lo; 



(b) 1 
< - 

n 



x loe 



\\P(x k+ i,y k+l \X\ Y k ) - Q(x fc+1) ifc+ilA* y fc )||i 



E £ 2^2 ln(2)D(P(x fe+1) y fc+1 |Xfc, yfe)||Q(x fc+1 , y fe+1 |Xfe, y*)) 

1*1 



fc=l 



^2 H 2 )D(P(x k+1 ,y k+1 \X k ,Y k )\\Q(x k+1 ,y k+1 \X k , Y k )) 



< - E 2 \/ 2 H^D(P(x k+1 ,y k+1 \X k ,Y k )\\Q(x k+1 ,y k+1 \X k ,Y k )) 



fe=i 



x lo. 



(d) 



^2 ln(2)ED(P(x fe+ i, y fc+ i|X fc , y fc )||Q(x fc+1 , y k+l \X k , Y k )) 



< 2^/2 ln(2)D(P(x n + 1 , y n + l ) \\Q(x n + l , y n+1 )) Jn log 



^2 ln(2)D(P(x" +1 , \\Q(x n+1 , y n+1 ))/n ' 

(61) 
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where (a) is an application of Lemma [2] and Lemma [4] in Appendix A, where Lemma [2] guarantees that 
when n — > oo, the i\ norm of the difference of P{xk+i, yk+i\X k , Y k ) and Q(xk+i, yk+i\X k , Y k ) will 
be small enough so that Lemma 0] can be applied, (b) is because of Pinsker's lemma and that function 
Vtlog(t) is increasing for small t, (c) and (d) are because of the concavity of ■ and the chain rule of 
Kullback-Leibler divergence. Because of the monotonicity of \ft\og(t) when t « 0, we can plug in the 
redundancy bounds of the CTW in Lemma |7] in Appendix A, i.e., Equation (l30b into Equation (f6Tb . then 
have 

E\B n \ = 0(n _1/2 (logn) 3/2 ). (62) 
Combining Equation (l62l with (l60l ). we proved Proposition [2 

£. Proof of Proposition \3\ 

We rephrase a general lemma showing minimax lower bounds: 

Lemma 8 (Theorem 2.2, Page 90) H42V Let P be a class of models, and suppose we have observations 
Z distributed according to Vf,f G P- Let d(f,f) be the performance measure of the estimator f(Z) 
relative to the true model f. Assume also d(-, •) is a semi-distance, i.e., it satisfies 

1) d(f,g) = d(g,f)>0, 

2) d(f,f) = 0, 

3) d(f,g)<d(h,f) + d(h,g). 

Let /o, /i € T be s.t. d(fo, /i) > 2s > 0, where s is arbitrary, then 

MsupVfWJ) >s)> inf max > s) 

f feT f je{o,i} 

>iexp(-L>(P /l ||P /o )). 

Denote the binary entropy as H^ip) = —plogp — (1 — p) log(l — p) and the class of i.i.d. processes 
as A^o- Since 

H' b {p) = \og 1 -^, 
p 

and H'Jp) is decreasing in interval [2/8,3/8], we know 
Lemma 9 Vp, q G [2/8,3/8], we Ziave 

\H b (p)-H b (q)\>log(5/3)\p-q\, 
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since H' b (3/8) = log(5/3). We also show a lemma bounding the divergence between two Bernoulli pmfs. 

Lemma 10 Let P and Q be Bernoulli pmfs with parameters, respectively, 1/2-p and 1/2-q. Let \p\, \q\ < 
1/4, then D(P\\Q) < 8{p-q) 2 . 

Lemma [TOl can be verified as follows, 

D(P\\Q) = (1/2 - p) log 1^Z£ + (1/2 + p) log 1^±£ 

= (1/2 - p) log ( 1 + ^=^-1 + (1/2 + p) log (l + ' 



1/2-9/ V 1/2 + 9 



ln(2) V 1/2-9 1/2 + 9 

1 (g ~ 9) 2 
ln(2) 1/4 -g 2 

<8(p-9) 2 , 

where the first inequality is because log(l + x) < xj ln(2), Vx > — 1, and the second inequality is because 

\q\ < i/4. 

Take the observations model as Xi ' Bernoulli^), 1^ = Xj, then we have / = H{X). Take 
go = 1/4, 91 = 1/4 + 1/y/n, take I n as an estimator of I, let d(x,y) = \x — y\, we have 

d(H b (q ),H b ( qi )) > log(5/3)|9o - 9i| = log(5/3)/V^, 

then we can take s = log(5/3)/(2y / n). We have 

inf sup T> q (d(I n , I) > s) > inf max V qj {d(I n ,H h (qj)) > s) 
in Mo i n ie{o,i} 

>±eM-D(P qi \\P qo )), 

then we bound D(P qi \\P qo ). When n > 64, 



D(P qi \\P qo )=nE 1 



log' 



< 8n(g - 9i) 2 
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then 



Using Markov inequality, 



MsupV q (d(I n ,I) > s) > ^e" 8 . 



infsupE|J n - J| > infsupE|J n - I\ > ]e' s s = -e~ 8 log(5/3) ' 



In V In Mo 4 8 yJU 

F. Proof of Theorem \3\ 
We decompose 

n n 



n 

i=i y^ 



Q(yi\Y*-i) n^^^ ' 1 b Q( Vi \X\Yi-iy 



Following the proof of almost sure and L\ convergence of H2 in that of Proposition 12 we can show the 
convergence of 

to iJ(Y||X) almost surely and in L\ under the conditions of Theorem [3] 
Denote 

1 - 1 

Q{yi\Y l l ) 

it suffices to show the almost sure and L\ convergence of F n to H(Y). Decompose F n — H(Y) as 

F n — H[Y) = R n + S n , 

where 

1 n 1 n 

P « = n E E P{VllX% l0g P ^l y ' _1 ) " n £ 2 Qfal**. I*" 1 ) log QCi/il^" 1 ) (64) 

i=l s/i i=l y« 

1 n - 

2=1 2/i 

7 J Almost sure convergence: According to Lemma |2] in Appendix A, the probability assignments in the 
CTW, Q(y i \X i 1 Y 1 ' 1 ) and QiviY 1 ' 1 ) both converge almost surely to the true probability P{y i \X i , Y 1 " 1 ) 
and PiyilY^ 1 ). Denote 

Zi = - J2 Q(Vi\X\Y*- 1 ) log Q{ yi \Y 1 - 1 ) + P(yi\X\Y*- 1 ) log P(y l \Y i - 1 ), 
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we have 



lim Zi = 0. P-a.s. 

i— >oo 



Then we know the Cesaro mean of {Zi}^ =1 also converges to zero almost surely, i.e., 

1 n 

lim R n = lim — } Zi = 0, P-a.s. 



i=l 

so it suffices to show S n converges to the entropy rate of Y almost surely, which follows by Birkhoff 
ergodic theorem. 

2) L\ convergence: We express R n in Equation (|64l ) as 

n 



i=l 2/4 

and bound E|i?„,|: 



1 



EI-RJ <-VE 



i=l 
n 



^P^I^F^log 



P(y^ 



n 



^ (P^l^,^- 1 ) - Q{ yi \X\Y*- 1 )) \ogQ( yi \Y*- 1 ) 



(66) 



< 



-Ve 



i=l 



log 



p(y i |y i - 1 ) 



1 n 
+ -Ve 

n ^ 



i=l 



5> 



QPSI^- 1 ) 
l 



i \ri— 1> 



Since the probability assignment in the CTW is lower bounded, see Equation ([7])), we have 

1 



Q{yi\Y % ~ 1 ) > 



2i + \y\ ' 



then we know 



log 



1 



Q{Vi\Y % 



-<\og{2i + \y\). 
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We continue to bound 



1 n 

+ -Ve 

n ^ 



io g (2i + \y\) \P(vi\x*, y 1 - 1 ) - Q{Vi\X\ Y 1 - 1 ] 



+ log(2i + \y\) yj2\n(2)ED(P(y i \X\Y^)\\Q(y i \X\Y^)) 



n 

i=l 



< lp^mP(y l \Y l - 1 )\\Q(y l \Y 1 - 1 )) + ^^y/ED(P( yi \Yi-i)\\Q( yi \Yi-i)) 



1 n l 

+ -E log(2i + |^|)^/2 In(2)ED(P(or i , y«-i)||g(x<, j/i|-XT*, F^ 1 )) 



n 

i=l 



+ lo g (2„ + |y|),' 21n(2)D(P(l ' V! '" )ll<3(l "' ! '" )) 



where 

• (a) is because of Equation (l34l) . 

• (b) is because of Pinsker's inequality, 

• (c) is by data processing inequality, 

• (d) is by the chain rule of Kullback-Leibler divergence and concavity of \f-. 

After applying Lemma |7] in Appendix A, we know R n converges to zero in L\. By Birkhoff ergodic 
theorem, we know the convergence of S n is also in L\, which completes the proof of L\ convergence. 

G. Proof of Theorem [?] 
We decompose I4 

I a = G n — H2, 

where H2 is the estimator for i^(Y||X) in I2, G n is defined as 

n 

E o(„ +I , B+1 |x',r ) io g ^ s - M . 

1=1 (x i+1 ,y i+1 ) 
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Then we can follow the steps in the proof of Theorem [3] to establish Theorem [4] analogously. 

Appendix C 
Proofs of Technical Lemmas 

A. Proof of Lemma [7] 

1) General stationary ergodic processes: According to the Shannon-McMillan-Breiman theorem for 
causally conditional entropy rate (for example, see ll32ll ). we know the convergence holds almost surely. 
Now we prove the AEP also holds in L\. 

Denote 

A n = --logP(Y n \\X n ), 
n 

B n = -iio g p(y"||x",xO oo ,yO oo ), 

n 

where P(Y n \\X n , X^, Y ^) = Uti p ( Y i\ x -oo^-^)- Our goal is to show E\A n - H(Y\\X)\ 
converges to zero when n — > oo. Note that 



EA n = -Y\H(Y i \Y i - 1 ,X i 

n — ^ 



n . 
i=i 



EB n = H(Y X) 



and denote 

C n = B n - A n . (67) 

Since H(Yi\Y' L ~ l , X' L ) is a non-negative non-increasing sequence with respect to i, it has limit ff(Y||X), 
by the fact that EA n is the Cesaro mean of {H(Yi\Y l ~ 1 , X l )}f =1 , it follows that EA n converges to 
H(Y\\X) as n — > oo, thus we know 

lim EC n = 0. (68) 



We have 

E|A n -i?(Y||X)| =E|A n -EB n |, 

<E|C„|+E|B n -EB n |. (69) 

By Birkhoff ergodic theorem, we know E\B n — EB n \ converges to zero when n — > oo. It now suffices 
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to show lim EICU = 0. Denote the CDF of random variable C n as F n (x), then we have 

71— >00 

POO 

E\C n \ = -EC n + 2 / xdF n (x), 
Jo 



-EC n + 2 / P(C n > x)dx, 
Jo 



(70) 



where the second step follows by integration by parts and the fact that 1 — F n {x) = P(C n > x). Let 
BiX^Y^) 4 {(x n ,y n ) : P(x n , y n \X _ oo ,Y () oo ) > 0}, we have 



E 



P(Y n \\X n ) 



PW-iX^Y^) 



E 



E 



P(Y n \\X n ) 



p{Y n \\X n ,X Q _ oo: Y^ c 



yO V 



E 



P(y n \\x n ) 



-P^^^X^Y 



E 



E 



p( !/ "IK)i>(z"|| ! /"-\x° oo ,y» 



(x",i/")€.B(X° tx ,,y° 00 ) 



< J] P(y"||x")P(x"||^- 1 ) 

(a;",!/") 

= J] P(s n ,|/ n ) 



1. 



Thus, by Markov inequality, we have 



P 



P{Y n \\X n ) 



p(yn|| X « >Jr O oo) yO o 

for arbitrary positive i n . Take t n = 2 ne , we have 



> tn ) < -, 



l 



p - log — 

n S p(y«||x-,xo oo ,r 00 ) 



> e < 2" 



which implies 



> x) < 2" ra . 



(71) 



Plug Equation (1711 ) into Equation (1701 . we have 

E\C n \ = -EC n + 2 / P{C n > x)dx 
Jo 



-EC n + 



io 
2 



nln(2)' 
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By Equation (1681 ). EC n — > as n — > oo, we know 



lim E\C n \ = 0. 

n— >oo 



By Equation (1691 ), we know the AEP for causally conditional entropy holds in L\. 
2) Irreducible aperiodic Markov processes: Take 



Z t = -\ogP{Y i \X\_ m ,Y^) - tf(Y||X), 

where m is the order of Markov process (X, Y), we have 

1 n 1 _ 

-V Zi = logP(Y n ||X n ) - flYYllX). 

n / — ' n 

i=i 

Denote 



and decompose Zi as follows: 



5i = -iogP(^|xt m ,^), 



Zi = gf + sf - H L - H L ' 



where gf = g i l {lg ,\< L} , g? = g t - fff, # L = %f, = #(Y||X) - H L = Egf . We expand 

/ n \ 2 / n \ 2 / n \ 2 

then deal with the three terms on the right hand side of Equation {74]) separately. 

For the first term, we can apply Lemma [5] in Appendix A. In this case, X is (X, Y), Vi is g\ 
and V is L here. According to Lemma [5J we have 

E {j2^-H L ) =0(nL 2 ) 

For the second term, we have 

E (j2 9i' - RL ^J < ^ 2 maxE( 5 f - # L ') 2 . 

Define 

^ = {(4-m,ylj ■ K < -logP( yi |xt m) ytm) < ^ + 1}, 
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we have 



E( 5 f -H L ') 2 <E(g^') 2 

oo „ 

K=L jEiK 
oo 

< J2 \y\(K + if2- K 



K=L 

= 0{L 2 2- L ), 

where the last inequality is an inequality developed by McMillan 11431 . and the last step could be intuitively 
understood since the terms decay rapidly, the sum is dominated by largest term, hence the order. Now 
we have 

E (|>f-tf L ') =0(n 2 L 2 2- L ). 
For the third term, we apply Cauchy-Schwarz inequality, and show 



A 



E 12^ - HL 



\ i=i 



= 0(n z l 2 L 2 2- L ^ 
Sum three terms together and take L = 2 log n, we have 

n 

E\J2 z i\ 2 = 0(n(log 

thus 



ii 



(75) 



El - -\ogP(Y n \\X n ) - H(Y\\X) 

n 



i n 



i=l 



1 

< 

n 



\ i=i 
= 0(n~ 1/2 logn). 

Now we deal with the almost sure convergence rates of AEP of causally conditional entropy rate. 
Restate Gal-Koksma theorem [44] as follows: 



Lemma 11 (Gal-Koksma's theorem) Let ($7, J 7 , P) be a probability space and let (Z n ) n >\ be a se- 
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quence of random variables belonging to IP, p > 1, satisfying 



E|Zm+i + Z M+2 + . . . + Z M+n 



0(*(n)) 



uniformly in M where 



n > 1 is a nonde creasing sequence. Then for every e > 0, 



n 



Z 1 (u) + Z 2 (u) + ... + Z n {uj) = o((tf(n)(logn) p+1+e )p) P-a.s. 



The bound shown in equation (|75T ) indicates that we can take \&(n) = n(logn) 2 , p = 2 in Gal-Koksma 
theorem, then we have 



B. Proof of Lemma \2\ 

Denote the alphabet size as M = \X\. From the probability weighting procedure shown in Equation (|9]) 
we know P,^(x n+ i\X n ) is a weighted summation of all of the probability estimates along the updating 
path and the weights sum into 1, where A denotes the root. Now we argue in the updating path, part of 
the weights of probabities will go to zero, the left probability estimates will converge almost surely to 
the true conditional probability. 

Suppose s is an internal node in the true tree source, we investigate the properties of j3 s {X n ). As is 
asserted in Lemma 4 of ||T5l in the binary alphabet case, /3 s (X n ) vanishes almost surely when s is an 
internal node. Here we restate this fact and present a proof for the general finite-alphabet case. 

Lemma 12 Suppose s is an internal node in the tree representation of the source, then 



--logP(Y n \\X n ) - H(Y\\X) = o{n 



/ 2 (logn) 5 / 2+e ) P- 



'-a.s. Ve > 



(76) 



/3 s (X n ) = P-a.s.asi^oo 



Proof: 



/3 s (X n ) Pe(X n ) 



P s (X n ) + 1 2P%,(X n ) 
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Denote leaf nodes that are offsprings of node s as (t±t2 ■ ■ ■ ti-s), we have 
/fc-i \ 



13 s (X r 



< fh 

P°{X") + 1 - [l\ 

/fc-1 



Yl{t!...t k ) 



exp 



(ti...t k ) 



Since nodes (ii*2 • • • ifcs) are leaf nodes of the context tree, we know 

iiog n p^ s (x")=iiog n p^ s (x n ). 

n - L - L n - LJ - 

(tx...t k ) (tl...t k ) 

It is verified in 11311 that there is a constant C depending only on the alphabet size M such that for every 
n > 1 and sequence x n , 

, rw m 1 , „m iVfalac?) M-llogn C 

■ log P e (aff ) - - \ i\T(o aff ) log v 1 1 ; + — 2_ <_. (77) 

n n 2 n n 

Since term ~ Saey ^M^i) 1°8 N ^ a ^ 1 ^ will converge almost surely to the true entropy when n — > oo, 
we know ^ logPg (X n ) will converge to the entropy function almost surely when n — > oo. Analogously, 
term ~logJT/ ti t \ P,f v 1 '" tkS (X n ) will converge to the weighted summation of several entropy functions 
almost surely when n — > oo with weights summing into one. Thus we can use the strict concavity of the 
entropy function to show 

ilo g P e s (X«)-ilog TT P^ s (X n ) 

fl <n . J- 



(ti-t k ) 



converges to a negative constant, because we know all of the offsprings of node s cannot all have the 
same distribution (otherwise they can be absorbed into s, which is contradictory to the assumption that 
s is an internal node). With this observation, the proof is straightforward. ■ 
From Lemma [12] we know the contributions of conditional probability estimates in the internal nodes 
will go to zero almost surely. Since data collected at the leaf nodes can be viewed as they were generated 
from i.i.d. sources, by Equation (1771 ). we know the Krichevsky-Trofimov probability estimates at leaf 
nodes converge to the true probability distributions almost surely, i.e., 



Q(x n+1 \X n ) - P(x n+1 \X n ) = P^(x n+1 \X n ) - P(x n+1 \X n ) -> P-a.s. as n ^ oo 
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C. Proof of Lemma\3\ 

Fix e > 0. Since M(X,y) is bounded and closed, /(•) is uniformly continuous. Thus there exists S e 
such that \f(P)-f(Q)\ < e, if ||P-Q||i < 5 e . Furthermore, /(•) is bounded by / max = log |Af|+log \y\. 
We have 

\f(P) ~ f(Q)\ < el {||P-Q||i<5J + /maxl{||F-Q|| 1 >a e } 



where K f 



<e + f m 
fu 



< e + 



li'-Qlli 



e + K e \\P-Q\\i, 



<5 e " 



D. Proof of Lemma [?] 
Since 

we bound |/(P) - f(Q)\ as 



H{Y\X) = H(XY) - H(X), 



\f(P) - f(Q)\ = \H P (XY) - H P (X) - Hq(XY) + H Q (X)\ 
< \H P (XY) - H Q {XY)\ + \H P {X) - H Q (X)\. 



By Lemma 2.7 in [45], we have 



\H P {XY) - H Q (XY)\ < 9log 
\H P (X) - H Q {X)\ < 0i log 



\x 



\x\ 



where 9 = \\Pxy — Qxy\\i and 6\ = \\Px — Qx\\i- Here Px is the marginal distribution of X under 
Pxy- By triangle inequality, >6\, then we have 

|/(P)-/(Q)|<201og^M 
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E. Proof of Lemma \5\ 

Definition 4 (a-mixing coefficient) For stationary process X adapted to nitration (Fn)™^ the a-mixing 
coefficient is defined as 

a(n) = sup \P{A (IB) - P(A)P(B)\, A £ ^Be 

According to [46], as X is a stationary irreducible aperiodic Markov process, a(n) goes to zero 
exponentially fast with n, i.e., there exists positive constants Cj > 0, C% > such that 



a(n) < C-je 



-C 8 n 



We bound E (1/n £? =1 Vif as follows 



E 



8=1 



= ^E e n 2 + ^ E E ^ 

j=l l<i<j'<n 

<— + A E EWi + i E Cre-^l-^V 2 
n n z z — ' n z z — ' 

l<i<j<n l<«<i<" 
T/2 ot/2 

n n z / — ' 

fc=i 

^ y 2 , 2C 7 y 2 



n n(e c ' 8 — 1) ' 

where the first inequality is because we change the probability measure of random vector (V£, Vj), i ^ j to 
the product of two marginals, and then upper bound the difference between the product of two marginals 
and the true measure using the definition of a-mixing coefficient and the uniform upper bound on 

Vi,i > 1. 

Thus, we show Lemma [5] holds where C 4 = V 2 (l + 2C 7 /(e Cs - 1)). 

Remark 5 We can write CV and C$ in the proof of Lemma [5] explicitly when the transition kernel of 
the Markov process X is symmetric. For a symmetric, aperiodic, irreducible transition kernel W, Ai = 1 
is a simple eigenvalue and all other eigenvalues satisfy |A 3 -| < 1, j ^ 1. It can be shown that in this case, 
the a-mixing coefficient is bounded by 

a(n) < y/m + le ln ^ A 2^ n 
where A2 is the second largest eigenvalue of W in absolute value. 
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